我有 python 3.7 的 conda 安装
$python3 --versionPython 3.7.6pyspark 是通过 pip3 install 安装的( conda 没有它的本机包)。
$conda list | grep pysparkpyspark2.4.5pypi_0pypi这是 pip3 告诉我的:
$pip3 install pysparkRequirement already satisfied: pyspark in ./miniconda3/lib/python3.7/site-packages (2.4.5)Requirement already satisfied: py4j==0.10.7 in ./miniconda3/lib/python3.7/site-packages (from pyspark) (0.10.7)jdk 11 已安装:
$java -versionopenjdk version "11.0.2" 2019-01-15OpenJDK Runtime Environment 18.9 (build 11.0.2+9)OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode)当尝试 import pyspark 时,事情并不顺利。这是一个小测试程序:
from pyspark.sql import SparkSessionimport os, sysdef setupSpark():os.environ["PYSPARK_SUBMIT_ARGS"] = "pyspark-shell"spark = SparkSession.builder.appName("myapp").master("local").getOrCreate()return sparksp = setupSpark()df = sp.createDataFrame({'a':[1,2,3],'b':[4,5,6]})df.show()这导致:
Error:Unable to initialize main class:org.apache.spark.deploy.SparkSubmitCaused by:java.lang.NoClassDefFoundError:org/apache/log4j/ spip / filter
这是完整的详细信息:
$python3 sparktest.py Error: Unable to initialize main class org.apache.spark.deploy.SparkSubmitCaused by: java.lang.NoClassDefFoundError: org/apache/log4j/spi/FilterTraceback (most recent call last): File "sparktest.py", line 9, in sp = setupSpark() File "sparktest.py", line 6, in setupSparkspark = SparkSession.builder.appName("myapp").master("local").getOrCreate() File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/sql/session.py", line 173, in getOrCreatesc = SparkContext.getOrCreate(sparkConf) File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/context.py", line 367, in getOrCreateSparkContext(conf=conf or SparkConf()) File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/context.py", line 133, in __init__SparkContext._ensure_initialized(self, gateway=gateway, conf=conf) File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/context.py", line 316, in _ensure_initializedSparkContext._gateway = gateway or launch_gateway(conf) File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/java_gateway.py", line 46, in launch_gatewayreturn _launch_gateway(conf) File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/java_gateway.py", line 108, in _launch_gatewayraise Exception("Java gateway process exited before sending its port number")Exception: Java gateway process exited before sending its port number任何关于康达工作环境的建议或信息都将不胜感激。
更新可能是 pyspark 仅可从 conda-forge 获得:我最近才开始将其用于 conda install。但它不会改变结果:
conda install -c conda-forge conda-forge::pysparkCollecting package metadata (current_repodata.json): doneSolving environment: done# All requested packages already installed.重新运行上面的代码仍然会让我们:
Error: Unable to initialize main class org.apache.spark.deploy.SparkSubmitCaused by: java.lang.NoClassDefFoundError: org/apache/log4j/spi/Filterpython推荐答案以下步骤用于在 Conda 环境中运行迷你测试程序:
步骤1:创建并激活一个新的 Conda 环境
conda create -n test python=3.7 -yconda activate test第 2 步:安装最新的 pyspark 和 pandas
pip install -U pyspark pandas# Note: I also tested pyspark version 2.4.7第 3 步:运行迷你测试。(我更新了一些更改以从 DataFrame 而不是 dict 创建 DataFrame)
from pyspark.sql import SparkSessionimport os, sysimport pandas as pddef setupSpark():os.environ["PYSPARK_SUBMIT_ARGS"] = "pyspark-shell"spark = SparkSession.builder.appName("myapp").master("local").getOrCreate()return sparksp = setupSpark()df = sp.createDataFrame(pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}))df.show()第 4 步:享受输出
+---+---+| a| b|+---+---+| 1| 4|| 2| 5|| 3| 6|+---+---+我用来安装 pyspark 的 Java 版本
$ java -versionjava version "15.0.2" 2021-01-19Java(TM) SE Runtime Environment (build 15.0.2+7-27)Java HotSpot(TM) 64-Bit Server VM (build 15.0.2+7-27, mixed mode, sharing)